Part A 1.multiple regression analysis

# Load the package
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
library(GGally)
## Warning: package 'GGally' was built under R version 3.4.4
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.4
library(car)
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4
# Read the file
df<-read.csv("SalesPerformance.csv", header = TRUE, stringsAsFactors = FALSE)
# Regression use all the variables
RM1 <- lm(Profit ~ Area + Popn + Outlets + Commis, data = df)
summary(RM1)
## 
## Call:
## lm(formula = Profit ~ Area + Popn + Outlets + Commis, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -773.91  -93.91   25.22  108.02  505.06 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 694.6890   367.1055   1.892  0.06475 .  
## Area        -22.9564     8.2917  -2.769  0.00809 ** 
## Popn        101.7687    61.4965   1.655  0.10476    
## Outlets       0.8301     1.5524   0.535  0.59540    
## Commis      316.7449    68.4808   4.625 3.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 220.6 on 46 degrees of freedom
## Multiple R-squared:  0.6518, Adjusted R-squared:  0.6215 
## F-statistic: 21.52 on 4 and 46 DF,  p-value: 4.648e-10

The model above contains all four potential variables. The dependent variable is profit. The summary chart indicates that the regression model has a adjusted r-squared of 0.6215, which is fair enough. However, the p values also show that some variables such as Popn and Commis are not very significant in the model. Commis has the most effect on PROFIT.

# Plot the regression
plot(RM1)

The plot 1 shows that the residuals are not that random distributed.
The plot 2 shows that the some standardized residuals do not fit the line well.
The plot 3 and plot 4 show that outliers such as ‘19’, ‘32’, ‘47’ exist.

  1. The combination effect of COMMIS and OUTLETS.
    (1)First, remove the variable COMMIS
RM2 <- lm(Profit ~ Area + Popn + Outlets, data = df)
summary(RM2)
## 
## Call:
## lm(formula = Profit ~ Area + Popn + Outlets, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -640.4 -203.6   20.0  175.7  456.2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1396.2926   400.3114   3.488  0.00107 **
## Area         -30.0000     9.7601  -3.074  0.00351 **
## Popn          52.1988    72.5125   0.720  0.47518   
## Outlets       -0.4619     1.8286  -0.253  0.80169   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.2 on 47 degrees of freedom
## Multiple R-squared:  0.4898, Adjusted R-squared:  0.4572 
## F-statistic: 15.04 on 3 and 47 DF,  p-value: 5.381e-07
plot(RM2)

(2)Then, remove the variable OUTLETS

RM3 <- lm(Profit ~ Area + Popn + Commis, data = df)
summary(RM3)
## 
## Call:
## lm(formula = Profit ~ Area + Popn + Commis, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -751.69  -90.93   21.43  101.61  499.67 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  792.701    315.650   2.511  0.01552 *  
## Area         -23.301      8.204  -2.840  0.00664 ** 
## Popn         116.444     54.613   2.132  0.03825 *  
## Commis       310.156     66.849   4.640 2.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 218.9 on 47 degrees of freedom
## Multiple R-squared:  0.6496, Adjusted R-squared:  0.6272 
## F-statistic: 29.04 on 3 and 47 DF,  p-value: 8.984e-11
plot(RM3)

  1. Try the model with only ‘Commis’ and ‘Outlets’
RM4 <- lm(Profit ~ Outlets + Commis, data = df)
summary(RM4)
## 
## Call:
## lm(formula = Profit ~ Outlets + Commis, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -815.56 -169.86   13.72  154.83  622.32 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -298.425    267.366  -1.116 0.269907    
## Outlets        6.962      1.392   5.001 8.04e-06 ***
## Commis       329.694     87.125   3.784 0.000428 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.1 on 48 degrees of freedom
## Multiple R-squared:  0.3889, Adjusted R-squared:  0.3634 
## F-statistic: 15.27 on 2 and 48 DF,  p-value: 7.359e-06
plot(RM4)

The combination effect of COMMIS and OUTLETS can be evaluated by the the summary tables and plots from all the regression models above. First, the R square is the highest in RM3(0.6272 ), that is the the model that removes OUTLETS, it is even higher than RM1(0.6215).

The results indicate that the proportion of the variance in the dependent variable that is predictable from the independent variable(s) is decrease with the involvement of OUTLETS. However, the differences between the R squares is not that significant, so we can not conclude the variables have a negative effect to each other.

From the plots of RM1, RM2 and RM3, it is clear that with the involvement of both COMMIS and OUTLETS, the models fit better to the dataset. Moreover, in RM1, the estimate of COMMIS and OUTLETS are 0.8301 and 316.7449, the absolute value of which is higher than the absolute in RM2 and RM3, -0.4619 and 310.156. Also, from RM4 with just COMMIS and OUTLETS as variables, p value is the most significant and the residuals are randomly distributed. So we can conclude that the two variables do have combination effect.

  1. Use Profit/Outlets as the dependent variable.
# Add a new column 
df$Profit_of_outlets<-(df$Profit)/(df$Outlets)
RM5 <- lm(Profit_of_outlets ~ Popn + Commis + Area, data = df)

summary(RM5)
## 
## Call:
## lm(formula = Profit_of_outlets ~ Popn + Commis + Area, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9040 -0.6038  0.0779  0.8308  2.4466 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.0652     1.8471   3.825 0.000384 ***
## Popn          0.0258     0.3196   0.081 0.935995    
## Commis        2.0420     0.3912   5.220 3.98e-06 ***
## Area         -0.1536     0.0480  -3.200 0.002462 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.281 on 47 degrees of freedom
## Multiple R-squared:  0.5527, Adjusted R-squared:  0.5242 
## F-statistic: 19.36 on 3 and 47 DF,  p-value: 2.58e-08
summary(RM1)
## 
## Call:
## lm(formula = Profit ~ Area + Popn + Outlets + Commis, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -773.91  -93.91   25.22  108.02  505.06 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 694.6890   367.1055   1.892  0.06475 .  
## Area        -22.9564     8.2917  -2.769  0.00809 ** 
## Popn        101.7687    61.4965   1.655  0.10476    
## Outlets       0.8301     1.5524   0.535  0.59540    
## Commis      316.7449    68.4808   4.625 3.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 220.6 on 46 degrees of freedom
## Multiple R-squared:  0.6518, Adjusted R-squared:  0.6215 
## F-statistic: 21.52 on 4 and 46 DF,  p-value: 4.648e-10

Comparing the summary tables of RM1 and RM5, I find that although the adjusted R-squared decreases, the Pr(>|t|) of all the variables is decrease, suggesting higher level of significance.

plot(RM5)

plot(RM1)

The plot 1 shows that the residuals are more random distributed for RM5. The plot 2 shows that the some standardized residuals do not fit the line better for RM5.

The analysis above concludes that use PROFIT divided by OUTLETS works better as a dependent variable than PROFIT itself.

Part B (25 marks): Omit variables that show little predictive value. Recall the models above, the three varibles COMMIS, POPN and AREA. Multicollinearity should be examined.

vif(RM5)
##     Popn   Commis     Area 
## 3.674139 1.111555 3.470757
X <- subset(df,select=-c(Profit_of_outlets,Profit,Outlets,Dist)) 
ggpairs(X)

The chart above shows that variables POPN and AREA have multicollinearity. Moreover, the the Pr(>|t|) of POPN shows extremely low level of significance. However, if omit POPN, the model only has 3 independent variables. The decision should be made after the summary and plots are analyzed.

Try a new model with PROFIT divided by OUTLETS as a dependent variable and omit POPN.

RM6 <- lm(Profit_of_outlets ~ Area + Commis, data = df)
summary(RM6)
## 
## Call:
## lm(formula = Profit_of_outlets ~ Area + Commis, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9112 -0.5961  0.0748  0.8306  2.4742 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.21035    0.42121  17.118  < 2e-16 ***
## Area        -0.15688    0.02574  -6.096 1.79e-07 ***
## Commis       2.03285    0.37057   5.486 1.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.268 on 48 degrees of freedom
## Multiple R-squared:  0.5527, Adjusted R-squared:  0.534 
## F-statistic: 29.65 on 2 and 48 DF,  p-value: 4.124e-09
plot(RM6)

Comparing RM6 and RM5, we conclude POPN should be omitted.

Part C (25 marks): Locate and, if possible, correct any serious violations of assumptions.

Recall the plots of RM5. Outliers exist: 19,32,47. Of which 19 has the most influence on the model.

plot(RM5)

Thus, we remove 19 and stored it into a new dataframe.

df_regression <- df[-(19), ]
df_regression <- subset(df_regression, select=c(Dist,Area,Commis, Profit_of_outlets))

Lastly, we show the revised model

Model_regression <- lm(Profit_of_outlets ~ Area + Commis, data = df_regression)
summary(Model_regression)
## 
## Call:
## lm(formula = Profit_of_outlets ~ Area + Commis, data = df_regression)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.00856 -0.57344 -0.05051  0.66923  2.41762 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.36532    0.35043  21.018  < 2e-16 ***
## Area        -0.16997    0.02149  -7.908 3.50e-10 ***
## Commis       2.21934    0.30944   7.172 4.48e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.05 on 47 degrees of freedom
## Multiple R-squared:  0.6787, Adjusted R-squared:  0.665 
## F-statistic: 49.64 on 2 and 47 DF,  p-value: 2.584e-12

Part D (20 marks): Once you have completed Parts A, B, and C, please write a brief report (not to exceed 500 words) to the marketing managers and explain your findings; the managers are not familiar with the technical language of statistics, although they do have an idea what a standard deviation is.

Dear marketing managers,

In order to better evaluate the field sales representatives’ performance and how the variables suspected effect the performance, regression models are built and analyzed. The findings and suggestions are listed as follows.

In the original model(RM1), Straightforwardly, ‘Profit’ is used as the dependent variable. ‘Popn’, ‘Area’, ‘Commis’ and ‘Outlets’ should all be involved. The summary table shows that ‘Outlets’ is the most significant while ‘Popn’ and ‘Outlets’ contributed very little to ‘Profit’. The results indicate that sales representatives motivated in a great deal by compensation. With all other variables being equal, a full-commission representative can generate 316.7 more profit than a partially salaried representatives. However, with an increase ‘area’, the ‘Profit’ only decrease 22.96. Moreover, Pr(>|t|) show low level of significant of ‘Area’ and ‘Outlets’. The model indicates that 62.15% of the variance in the dependent variable that is predictable from the independent variable. Which is fair enough.

To analyze the combination effect of ‘Commis’ and ‘Outlets’, we find that with the involvement of both ‘Commis’ and ‘Outlets’, the models fit better to the dataset. Moreover, in RM1, the estimate of ‘Commis’ and ‘Outlets’ are 0.8301 and 316.7449, the absolute value of which is higher than the absolute in RM2 and RM3, -0.4619 and 310.156. So we can indicate that the two variables do have combination effect.

As mentioned in the question, ‘Profit’ may be inflated for representatives with many outlets. Changing the dependent variable into profit per outlets, we did another regression. The plots and parameters indicate that it is more reasonable to evaluate the performance of sales representatives using profit per outlets, that is divide ‘Profit’ by ‘Outlets’.

After the dependent variable is determined, we need to examine all the independent variables and remove those are not significant. Note that we only have three variable now, ‘Area’, ‘Popn’ and ‘Commis’. First, examine wether the variables are dependent on each other. In other words, if one variable is highly correlated with another, one of them should be removed. In our case ‘Popn’ and ‘Area’ is highly correlated. Moreover, the analysis above shows that ‘Popn’ has a low level of significance. Hence, ‘Popn’ should bd removed from the model.

Lastly, the dataset contains outliers. Outliers refer to the usual or extreme rows that will effect the model in a great deal. Dist = 19 in our case is the outlier that should be removed. Then we can get our model with profit per outlets as dependent variable and ‘Area’ and ‘Commis’ as independent variable.

To sum up, it is more reasonable to use profit per outlet to evaluate the sales representatives’ performance. ‘Commis’ is the most critical element while ‘Area’ also should be considered. The representation should be offered full-commission to be motivated. On the contrary, ‘Popn’ should not be considered as it is not significant and is highly correlated to ‘Area’. Unusual or extreme situation should be omitted to obtain a fair evaluation.